In jefftwebb/lightbulb: Tutorials for Learning Data Science

library(learnr)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)

set.seed(127)
df <- data.frame(Day = 1:30,
                 A = round(rnorm(30, mean = 300, sd = 120), 2),
                 B = c(98.83, rep(NA, 29)))

A <-  df$A


d <- read_csv("data/flight_delay_clean.csv")

# Initialize empty vector
boot_distribution <- NULL
# Define number of simulations
sims <- 1000

# Subset data
regionex <- filter(d, airline == "RegionEx")
mda <- filter(d, airline == "MDA")

# Calculate the difference in median delays for each iteration
set.seed(123)

for(i in 1:sims){

  boot_regionex <- sample(regionex$delay, replace = T)
  boot_mda <- sample(mda$delay, replace = T)

  # These are both temporary bootstrap samples that will be overwritten/redefined
  # for every loop.

  boot_distribution[i] <- median(boot_regionex) - median(boot_mda)

  # We index the vector, boot_distribution, with [i] in order to accumulate
  # a new simulated difference for each loop iteration.

}

# Inspect the bootstrap distribution of difference in median delay
head(boot_distribution, 20)

Introduction

The bootstrap is a computational method for estimating the standard error (SE) for any test statistic.

Why does that matter, you ask? After all, previous modules covered confidence intervals in some detail, which are explicitly defined in terms of the SE, and in that case we used a formula to estimate the SE: $\frac{s}{\sqrt{n}}$. Why not just keep using that? The answer is that we can, and will, but there are a number of other sample statistics, such as the median, for which there is no theoretically-derived formula to estimate the SE. Furthermore, the formula for SE assumes the sample is normally distributed. If it is not then the formula will not do a good job of estimating the standard deviation of the sampling distribution. To obtain the SE in these situations requires a different approach.

Why do we care about the SE, anyway? Because we use it for inference. 95% CIs offer an informal decision procedure by defining the range of mean values most consistent with the sample.

The value of the bootstrap is that it gives us the ability to simulate a sampling distribution for any statistic from any sample (provided that $n \ge 30$), calculate its standard deviation (which, because it is a sampling distribution, is equivalent to the SE), and define a CI. Thus, we can use the bootstrap to do inference for any sample statistic. This is extremely useful.

The Bootstrap

Bootstrapping, also known as resampling, was invented by Stanford statistician Bradley Efron in the late 1970s. It was one of the first computational (as opposed to analytical) techniques for statistical estimation. The name comes from the phrase "to pull oneself up by the bootstraps."

The bootstrap involves "resampling" from a source dataset ($Z$) multiple times to create "bootstrap samples" ($Z^{1}, Z^{2}, Z^{3}...Z^{B}$), and calculating a sample statistic each time ($\hat\alpha^{1},\hat\alpha^{2},\hat\alpha^{3}...\hat\alpha^{B}$. Each bootstrap sample contains different observations and therefore produces a slightly different sample statistic.

The collection of these sample statistics, called the "bootstrap distribution," approximates a sampling distribution for that statistic, the standard deviation of which is the standard error. That is why the technique is called "bootstrapping." It seems to accomplish the impossible, like lifting yourself by the bootstraps: a sampling distribution is generated without actually going through the expense and trouble of a large scale sampling effort. (Note that the source dataset should have $n \ge 30$ for the bootstrap to provide good estimates.)

Let's illustrate the method using the flight delays example we analyzed in the first module. The question we addressed there was whether on‐time performance for RegionEx was worse than for MDA, as measured by delay. One of the issues in the case was to select the appropriate metric for comparison. The mean was problematic because of outliers in RegionEx delays, apparent in the density plot:

# Plot delay by airline
ggplot(d, aes(delay, col = airline))+
  geom_density() +
  theme_minimal()+
  labs(title = "Arrival delay in minutes by airline")

It turned out, curiously, that RegionEx had higher average delays but lower median delays.

# Calculate summary statistics
d %>% 
  group_by(airline) %>% 
  summarize(n = n(),
            mean = mean(delay, na.rm = T) %>%  round(2),
            median = median(delay, na.rm = T) %>%  round())

In general, the median is the appropriate measure of central tendency when there are outliers. In this case we can see that the medians are different. But is the difference statistically significant? This is a question best addressed by bootstrapping, since the formula for computing an SE for the median, in order to do statistical inference, is obscure, requires large $n$ to be accurate, and depends on strong normality assumptions. One of the advantages of the bootstrap is its generality: a single method can be used to compute SEs for any sample statistic.

The null hypothesis, $H_0$, is that MDA and RegionEx delays come from the same population and that the observed difference is due to random sampling variation. Our inferential strategy will be to simulate a sampling distribution of median delay difference using the bootstrap. Assuming $H_0$, the resulting bootstrap distribution should be centered at 0 and the 95% CI should include 0. If it does not then we can reject the null hypothesis.

Bootstrap steps

Initialize an empty vector to store the bootstrap distribution of delay differences. This vector will be a simulated sampling distribution.

# Initialize empty vector
boot_distribution <- NULL

Define a for loop, in each iteration of which we will calculate and store a simulated difference in median delay. These stored delay differences will be the bootstrap distribution. Because we will be using a random process inside the loop we also need to set the seed, and should also define the number of simulations. We will also divide the data into RegionEx and MDA subsets before the loop since dividing it once is much faster than dividing it within each loop i times.

# Initialize empty vector
boot_distribution <- NULL

# Define number of simulations
sims <- 1000

# Subset data
regionex <- filter(d, airline == "RegionEx")
mda <- filter(d, airline == "MDA")

# Set up the for loop
set.seed(123)
for(i in 1:sims){

}

The number of iterations is somewhat arbitrary. Generally 1000 iterations should be sufficient.

Within the loop, for each iteration, take a bootstrap sample of the original subsetted data and calculate the median, storing it in the initialized vector. How do we take a bootstrap sample? Simply use the sample() function with replace = TRUE. (The dplyr functions, sample_frac() and sample_n() work well too, especially when the object to be sampled is a data frame.)

# Initialize empty vector
boot_distribution <- NULL

# Define number of simulations
sims <- 1000

# Subset data
regionex <- filter(d, airline == "RegionEx")
mda <- filter(d, airline == "MDA")

# Calculate the difference in median delays for each iteration
set.seed(123)

for(i in 1:sims){

  boot_regionex <- sample(regionex$delay, replace = T)
  boot_mda <- sample(mda$delay, replace = T)

  # These are both temporary bootstrap samples that will be overwritten/redefined
  # for every loop.

  boot_distribution[i] <- median(boot_regionex) - median(boot_mda)

  # We index the vector, boot_distribution, with [i] in order to accumulate
  # a new simulated difference for each loop iteration.

}

# Inspect the bootstrap distribution of difference in median delay
head(boot_distribution, 20)

boot_distribution <- NULL

sims <- 1000

regionex <- filter(d, airline == "RegionEx")
mda <- filter(d, airline == "MDA")

set.seed(123)
for(i in 1:sims){
  boot_regionex <- sample(regionex$delay, replace = T)
  boot_mda <- sample(mda$delay, replace = T)
  boot_distribution[i] <- median(boot_regionex) - median(boot_mda)
}

# Visualize the bootstrap distribution of medians
  ggplot(data.frame(median = boot_distribution), aes(median)) +
  geom_histogram() +
  theme_minimal()+
  labs(title = "Bootstrap distribution of difference in sample medians")

Use the quantile() function to calculate a 95% CI using the bootstrap distribution.

# Define CI by 2.5th and 97.5th percentiles
quantile(boot_distribution, probs = c(.025, .975))

This is the so-called "percentile method" for calculating a CI. Another option, which produces a similar but more robust result, would be to use the standard deviation of the bootstrap distribution---equivalent to the SE---to compute the CI analytically, as follows:

# Calculate lower and upper bounds for 95% CI
mean(boot_distribution) - 1.96 * sd(boot_distribution)
mean(boot_distribution) + 1.96 * sd(boot_distribution)

The results are very similar.

Why have we used 1.96? As a simulated sampling distribution, the bootstrap distribution will be normally distributed (roughly) due to the Central Limit Theorem. We can therefore exploit the properties of the normal distribution, using the region defined by $\pm$ approximately 2 standard deviations around the mean to define the central 95% of the data.

Back to our question. The 95% CI for median delay difference does not include 0. We can therefore confidently reject the null hypothesis and conclude that median delay for MDA is significantly higher than for RegionEx (even as RegionEx's mean delay was higher than MDA's).

Do we need a p-value? Not really. The CI provides that information and more. But computing a p-value is straightforward using the bootstrap distribution.

# Calculate p-value using bootstrap distribution
sum(boot_distribution >= 0) / sims

This tells us that median delay difference of 0---our null hypothesis---would essentially never happen in the bootstrap distribution.

Bootstrap Details

Sampling with replacement

Sampling with replacement means that before every draw all of the items in the source dataset are replaced and are therefore available for being drawn again. Here is an example:

sample(c("A", "B", "C", "D", "E"), replace = T)
sample(c("A", "B", "C", "D", "E"), replace = T)
sample(c("A", "B", "C", "D", "E"), replace = T)
sample(c("A", "B", "C", "D", "E"), replace = T)

Notice that because items are returned to the source data before every draw, they can be oversampled. Contrast this with sampling without replacement:

sample(c("A", "B", "C", "D", "E"), replace = F)
sample(c("A", "B", "C", "D", "E"), replace = F)
sample(c("A", "B", "C", "D", "E"), replace = F)
sample(c("A", "B", "C", "D", "E"), replace = F)

Here there is no oversampling because sampled items are never replaced. Each sample therefore repeats the source data exactly, but with the order shuffled.

For loops

In general in R programming it is best to avoid loops, as they can be quite slow. However, bootstrapping is one task that is most easily accomplished with a loop (though vectorized methods do exist).

In a for loop, the index or counter for each iteration is defined in the first line of code: for(i in 1:n). Here the index is i (but we could choose any string, such as for(reps in 1:n)). In the first loop i will be 1, in the second 2, and so on. If we want to store information generated in each loop, then we need to initialize an empty vector or data frame beforehand. In the above code we indexed the initialized vector, bootstrap_distribution, with i---`bootstrap_distribution[i]---so that each loop adds a new value to the vector up to n. Here is a simple example.

for(i in 1:10){
  print(i)
}

Setting the seed

Setting the seed once before the loop will suffice to make the results reproducible. That is because R will automatically change the seed with each new random process (each bootstrap sample in each loop in this case) but will do so deterministically. So the seed for each loop will not be the value set originally---if it were there would be no variation---but will be deterministically related to the one before. Consider:

set.seed(123)
sample(c("A", "B", "C", "D", "E"), replace = T)
sample(c("A", "B", "C", "D", "E"), replace = T)

set.seed(123)
sample(c("A", "B", "C", "D", "E"), replace = T)
sample(c("A", "B", "C", "D", "E"), replace = T)

The results are identical, and therefore reproducible, even though we have not explicitly set the seed for the second random process.

Bootstrap SEM

We have seen how to use the bootstrap to calculate the SE for a sample median. In the case of the median, there is no alternative estimation method, so we cannot compare the performance of the bootstrap to an accepted benchmark. In order to convince you that the method works, let's now use the bootstrap to calculate SEs for RegionEx and MDA mean flight delay (technically, SEMs), and compare the results to the following CIs estimated analytically:

# Calculate summary statistics
d %>% 
  group_by(airline) %>% 
  summarize(n = n(),
            mean = mean(delay),
            se = sd(delay)/sqrt(n),
            lowerCI = mean - 1.96 * se,
            upperCI = mean + 1.96 * se)

Note that the implementation will be a little different because our goal is to replicate the CIs for mean delay for each airline.

Here is the bootstrapped CI for RegionEx mean delay.

# Initialize empty vector
boot_distribution <- NULL

# Define number of simulations
sims <- 1000

# Subset data
regionex <- filter(d, airline == "RegionEx")

# Calculate the RegionEx mean for each iteration
set.seed(123)
for(i in 1:sims){
  boot_distribution[i]<- sample(regionex$delay, replace = T) %>%  mean
}

quantile(boot_distribution, probs = c(.025, .975))

# Initialize empty vector
boot_distribution <- NULL

# Define number of simulations
sims <- 1000

# Subset data
mda <- filter(d, airline == "MDA")

# Calculate the MDA mean for each iteration
set.seed(123)
for(i in 1:sims){
  boot_distribution[i]<- sample(mda$delay, replace = T) %>%  mean
}

quantile(boot_distribution, probs = c(.025, .975))

The resulting CIs are very close to those we estimated analytically, using the formulas.

Bootstrap difference in means

As we've noted, assessing airline performance in terms of average delay is misleading, due to the outliers in RegionEx delays. Nevertheless, we can bootstrap a confidence interval for the average difference in delay and compare it to the t-test interval for comparison. Here are the results of t-test:

t.test(filter(d, airline == "RegionEx")$delay,
       filter(d, airline == "MDA")$delay)

The difference in average delay is statistically significant. The confidence interval for the difference is [1.07, 8.46].

Here is the bootstrap estimate of the same quantity:

# Initialize empty vector
boot_distribution <- NULL

# Define number of simulations
sims <- 1000

# Subset data
mda <- filter(d, airline == "MDA")
regionex <- filter(d, airline == "RegionEx")

# Calculate mean difference for each iteration
set.seed(123)
for(i in 1:sims){
  mda_mean <- sample(mda$delay, replace = T) %>%  mean
  regionex_mean <- sample(regionex$delay, replace = T) %>%  mean
  boot_distribution[i]<- regionex_mean - mda_mean
}

quantile(boot_distribution, probs = c(.025, .975))

The results are very similar, but---and this is the advantage of the bootstrap---do not require any complicated formulas for calculating t-statistics or standard errors.

Linear Regression

In the introduction to this tutorial, we noted that the bootstrap is a method for estimating standard errors for any test statistic. Could it be used with regression? Yes. We can easily bootstrap standard errors and confidence intervals for regression coefficients, using the steps introduced earlier.

Initialize a vector for storing the regression coefficient of interest.
Define a for loop (making sure to set the seed for reproducibility).
Within each loop take a bootstrap sample of the data, fit the regression, and extract and store the desired regression coefficient or predictive difference.
Compute the confidence interval for the quantity of interest and assess statistical significance.

jefftwebb/lightbulb documentation built on March 28, 2023, 12:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com